Assignment 1

Jasper Phang Wee Keat (A0201523Y)


Resale Flat Prices Analysis

The Resale Flat Prices dataset used for Part 1 & 2 is from the Housing Development Board (HDB) and downloaded from data.gov.sg.


Part 1(a) - Fact Checking

The following code will fact-check the following claims in this ST article published in January 2022:

  1. In 2021, there have been 261 HDB flats transacted at or more than $1m, which make up 0.9 per cent of total HDB resale transactions.
  2. HDB resale prices rose 0.8 per cent in December 2021 from the previous month.
  3. HDB resale prices in December 2021 were 13.6 per cent higher than a year ago.
  4. There were 2,429 HDB flats sold in December 2021.
  5. A plot of HDB resale volume from the article will be replicated.

Required packages and data

library(tidyverse)
library(ggthemes)
library(plotly)
library(lubridate)
library(DT)
library(treemapify)
library(viridis)
library(ggridges)
library(stringr)
library(tidytext)
library(ggpol)
library(gganimate)
library(gifski)

Dataset of Resale Flat Price from January 2017 onwards

According to the metadata, the data was recently updated on 18 Feb 2022 and contains transactions between 2017-01-01 to 2022-02-17. It contains information on the month, town, flat type, floor area, lease remaining, resale price and other information.

It is also important to note that “The transactions exclude resale transactions that may not reflect the full market price such as resale between relatives and resale of part shares.” and “For March 2012 onwards, the data is based on date of registration for the resale transactions.’

resale_2017_2022 <- read_csv("data/resale-flat-prices/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")

datatable(head(resale_2017_2022))

Inspect Data

We first explore the resale prices and dates to inspect if there are anomalies.The results fall within reasonable range with flats priced between $140,000 and $1,360,000 and dates corresponding to the metadata.

summary(resale_2017_2022$resale_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  140000  345000  430000  461782  544840 1360000
unique(resale_2017_2022$month)
##  [1] "2017-01" "2017-02" "2017-03" "2017-04" "2017-05" "2017-06" "2017-07"
##  [8] "2017-08" "2017-09" "2017-10" "2017-11" "2017-12" "2018-01" "2018-02"
## [15] "2018-03" "2018-04" "2018-05" "2018-06" "2018-07" "2018-08" "2018-09"
## [22] "2018-10" "2018-11" "2018-12" "2019-01" "2019-02" "2019-03" "2019-04"
## [29] "2019-05" "2019-06" "2019-07" "2019-08" "2019-09" "2019-10" "2019-11"
## [36] "2019-12" "2020-01" "2020-02" "2020-03" "2020-04" "2020-05" "2020-06"
## [43] "2020-07" "2020-08" "2020-09" "2020-10" "2020-11" "2020-12" "2021-01"
## [50] "2021-02" "2021-03" "2021-04" "2021-05" "2021-06" "2021-07" "2021-08"
## [57] "2021-09" "2021-10" "2021-11" "2021-12" "2022-01" "2022-02"

Organise Data

We then rename the ‘month’ column to ‘Date’ and convert it from a character data type to date data type.

resale_2017_2022 %>%
  rename(Date = month) %>% 
  mutate(Date = ymd(Date,truncated=1)) -> resale_2017_2022

1 | Flats transacted at or more than $1m in 2021

We filter the transactions in 2021 and reorder the columns to show the date and price first.

resale_2017_2022 %>% 
  filter(year(Date) == 2021) %>% 
  arrange(resale_price) %>%   
  select(Date, resale_price, everything())  -> resale_2021

datatable(head(resale_2021,10))

We then filter transactions that were above $1m in 2021.

resale_2021  %>% 
  filter(resale_price >= 1000000) -> resale_2021_1m 

datatable(resale_2021_1m)

We can now calculate the % of units in 2021 that transacted at or more than $1m.

nrow(resale_2021_1m) / nrow(resale_2021) * 100
## [1] 0.8900955

Conclusion - 259 instead of 261 HDB flats transacted at or more than $1m which made up approximately 0.890%. The claim is false.

2 | Monthly Change in HDB Resale Price in 2021

We first calculate the average resale price during each month in 2021 and then the monthly percentage change.

resale_2021  %>% 
  mutate(resale_2021, Month = month(Date, label = TRUE)) %>%
  group_by(Month) %>% 
  summarise(mean_resale_price = mean(resale_price),
            median_resale_price = median(resale_price)) %>% 
  mutate(mean_pct_chg = (mean_resale_price - lag(mean_resale_price)) 
                        / lag(mean_resale_price) * 100) %>% 
  mutate(median_pct_chg = (median_resale_price - lag(median_resale_price))
                        / lag(median_resale_price) * 100 ) %>% 
  mutate_if(is.numeric, round, digits = 3) -> average_monthly_resale

datatable(average_monthly_resale, rownames = FALSE)

Conclusion - The mean HDB resale price dropped 0.173% while the median HDB resale price rose 0.397% in December 2021 from the previous month. The claim of 0.8% increase is false.

We can also plot the results to see the trend of HDB Resale Price in 2021.

ggplot(average_monthly_resale, aes(x=Month, group=1)) +
geom_line(aes(y=mean_resale_price, color="Mean")) + 
geom_line(aes(y=median_resale_price, color="Median")) +
scale_colour_manual(values = c("Mean" = "Red", "Median" = "Blue")) +
scale_y_continuous(labels = scales::dollar_format()) +
theme_minimal() +
labs(x='Month', y='Price (SGD)', title="2021 Average HDB Resale Price", color="Legend", 
     caption = 'Data: HDB, data.gov.sg (2022)') +
theme(plot.title = element_text(face ='bold'),
      legend.position = 'bottom',
      axis.title.x = element_text(vjust=-2),
      axis.title.y = element_text(vjust=2),
      )

3 | Annual Change in HDB Resale Price

We first calculate the average resale price during each December in from 2017-2021 and then the annual percentage change.

resale_2017_2022  %>% 
  filter(month(Date) == 12) %>%
  group_by(year(Date)) %>%
  summarise(mean_resale_price = mean(resale_price),
            median_resale_price = median(resale_price)) %>%
  mutate(mean_pct_chg = (mean_resale_price - lag(mean_resale_price)) 
                        / lag(mean_resale_price) * 100) %>% 
  mutate(median_pct_chg = (median_resale_price - lag(median_resale_price)) 
                        / lag(median_resale_price) * 100 ) %>% 
  mutate_if(is.numeric, round, digits = 3) %>% 
  rename(Year = "year(Date)") %>% 
  mutate(Month = "Dec", .after = "Year") -> average_yearly_resale

datatable(average_yearly_resale, rownames = FALSE)

Conclusion - The mean HDB resale price increased 10.3% while the median HDB resale price rose 12.098% in December 2021 from the previous year. The claim of 13.6% increase is false.

We can also plot the results to see the trend of HDB Resale Price in December from 2017-2021

ggplot(average_yearly_resale, aes(x=paste(Month,Year), group=1)) +
geom_line(aes(y=mean_resale_price, color="Mean")) + 
geom_line(aes(y=median_resale_price, color="Median")) +
scale_colour_manual(values = c("Mean" = "Red", "Median" = "Blue")) +
scale_y_continuous(labels = scales::dollar_format()) +
theme_minimal() +
labs(x='Year', y= 'Price (SGD)', 
     title= "Average HDB Resale Price in December", color="Legend", 
     caption = 'Data: HDB, data.gov.sg (2022)') +
theme(plot.title = element_text(face ='bold'),
      legend.position = 'bottom',
      axis.title.x = element_text(vjust=-2),
      axis.title.y = element_text(vjust=2),
)

4 | HDB Flats Sold in December 2021

resale_2021  %>% 
  filter(month(Date) == 12) -> resale_dec_2021

nrow(resale_dec_2021)
## [1] 2425

Conclusion - There were 2,425 HDB flats sold in December 2021. The claim of 2,429 is false.

5 | Replicating Plot of HDB Resale Volume

The plot from the article as shown below will be replicated.

We first filter the number of transactions for each month between Dec 2020 - Dec 2021.

resale_2017_2022 %>% 
  filter(Date >= as.Date("2020-12-01") & Date <= as.Date("2021-12-31")) %>% 
  count(Date) %>% 
  mutate(Month = month(Date, label = TRUE), .after='Date') %>%
  mutate(Year = year(Date), .after='Date') %>%
  rename(transactions = n) -> resale_dec20_dec21

We can now plot the bar column plot with our data.

resale_dec20_dec21 %>% 
  ggplot(aes(x=Date, y=transactions), group=Month) +
  geom_col(fill='lightblue', width=20) +
  geom_col(data = resale_dec20_dec21 %>% filter(Month == 'Dec'),
           fill='blue', width=20, alpha=0.25) +
  geom_text(data = resale_dec20_dec21 %>% filter(Month == 'Dec'),
            aes(label = transactions), nudge_y = 150, fontface ='bold') +
  geom_text(data = resale_dec20_dec21 %>% filter(Month != 'Dec'),
            aes(label = transactions), nudge_y=150, fontface='plain', color='darkgray') +
  labs(title = "HDB resale volume", subtitle = "No. of transactions", 
       caption = 'Data: HDB, data.gov.sg (2022)') +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %y") + 
  scale_y_continuous(breaks = c(500,1000,1500,2000,2500,3000), expand = c(0,300)) +
  theme_minimal() +
  theme(plot.title = element_text(face ='bold'), 
        plot.caption = element_text(hjust = -0.06, vjust=10), 
        axis.title.x = element_blank(), axis.title.y = element_blank(),
        axis.text.x = element_text(vjust = 12, face = 
                      ifelse(levels(resale_dec20_dec21$Month) == "Jan", "bold", "plain")),
        axis.text.y = element_text(color = 'darkgray'),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()
        ) +
  geom_hline(yintercept = 0, color = "black", size = 0.5)


Part 1(b) - Further Analysis

The following code will explore 5 meaningful insights from the same dataset.

  1. Resale Price trends of different HDB Flat type over the years
  2. Change in Median Resale Prices across Towns between 2012-2021
  3. Effect of Storey Height on Resale Price
  4. Effect of Lease on Floor Area
  5. Median Resale Prices across Towns and Flat Types in 2021

Organising Data

We first import more resale prices data from 2000-2016 for time series analysis. As mentioned in the beginning, transactions before March 2012 are based on approval date instead of registration date. But that should not affect our analysis.

resale2015_2016 <- read_csv("data/resale-flat-prices/resale-flat-prices-based-on-registration-date-from-jan-2015-to-dec-2016.csv")
resale2012_2014 <- read_csv("data/resale-flat-prices/resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014.csv")
resale2000_2012 <- read_csv("data/resale-flat-prices/resale-flat-prices-based-on-approval-date-2000-feb-2012.csv")
import_resale <- bind_rows(resale2000_2012, resale2012_2014, resale2015_2016)

datatable(head(resale2012_2014))
datatable(head(resale2015_2016))

For the newly imported data, We will rename the ‘month’ column to ‘Date’ and convert it from a character data type to date data type as we did before. Also, we will drop all transactions in 2022 from the dataset with transactions from 2017 onwards used in Part 1 .

Next, the remaining lease data needs to be updated. Data from 2000-2014 do not have any remaining lease column, while data from 2017 indicates year and months as characters. To simplify things, we will recalculate the remaining lease from the date of transaction and lease commencement date and overwrite the data.

import_resale %>%
  rename(Date = month) %>% 
  mutate(Date = ymd(Date,truncated=1)) -> import_resale

resale_2017_2022 <- rbind(import_resale,resale_2017_2022)
resale_2017_2022 %>%
  filter(Date <= as.Date("2021-12-31")) %>% 
  mutate(remaining_lease = 99 -(year(Date)- lease_commence_date)) -> resale_2000_2021

datatable(head(resale_2000_2021))

0 | Data Exploration

With our resale data from 2000-2021 showing the remaining lease during the transaction and proper date data we can proceed to explore the data.

resale_2000_2021  %>% 
  group_by(year(Date), flat_type) %>% 
  summarise(median_resale_price = median(resale_price), transactions=n()) %>% 
  rename(Year = "year(Date)") %>% 
  arrange(flat_type) -> median_yearly_resale_2000_2021

datatable(median_yearly_resale_2000_2021, rownames = FALSE)

We can observe from the data table that some flat types are less commonly transacted (sometimes less than 20 a year), we will plot a treemap to better understand which flat types are more commonly transacted.

ggplot(subset(median_yearly_resale_2000_2021, Year==2021),
       aes(area = transactions, fill = flat_type, label=flat_type)) +
geom_treemap() +
geom_treemap_text(colour = "white", place = "centre", size = 15) +
labs (title = "HDB resale transactions by Flat Type",
      subtitle = "No. of transactions in 2021", 
      caption = 'Data: HDB, data.gov.sg (2022)') +
scale_fill_brewer(palette = "Greys", name = "Flat Type") +
theme_minimal() +
theme(plot.title = element_text(face ='bold'))

Conclusion - We can identify that majority of HDB Resale transactions are 4 Room units, while 1 Room, 2 Room and Multi-Gen units are less commonly transacted. With this information, our analysis will focus more on 3, 4, 5 Rooms and Executive units.

2 | Change in Median Resale Prices across Towns between 2010-2021

Lets look at the transactions of 3, 4, 5 Rooms and Executives between 2000 and 2021.

resale_2000_2021  %>%
  group_by(year(Date), flat_type, town) %>% 
  summarise(median_resale_price = median(resale_price), transactions=n()) %>% 
  rename(Year = "year(Date)") %>%
  filter(flat_type %in% c("3 ROOM", "4 ROOM", "5 ROOM", "EXECUTIVE"),
          Year %in% c(2000,2021)) %>% 
  arrange(flat_type, town, Year) -> median_chg_resale_2000_2021

datatable(median_chg_resale_2000_2021, rownames = FALSE)

We can tell from the table above that we do not have sufficient transactions for some flat types and towns. (E.g 3 Room - Punggol). This may be attributed to the town being newly developed in 2000 and typically lower frequency of 3 room transactions. Thus, We will narrow our analysis to 4 Room Flats, which accounts for majority of HDB Resale transactions, and between 2010-2021 where we have sufficient data for all towns.

Next, we will calculate the percentage change in median resale prices of 4 Room flats across different towns between 2010-2021.

resale_2000_2021  %>%
  group_by(year(Date), flat_type, town) %>% 
  summarise(median_resale_price = median(resale_price), transactions=n()) %>% 
  rename(Year = "year(Date)") %>%
  filter(flat_type %in% "4 ROOM", Year %in% c(2010,2021)) %>%
  mutate(period = case_when(Year == 2010 ~ "Old", Year == 2021 ~ "New" )) %>%
  mutate(period = factor(period, levels = c("Old","New"))) %>% 
  arrange(flat_type, town, Year) %>% 
  group_by(town) %>%
  mutate(pct_chg = ((median_resale_price) - lag(median_resale_price))
                  / lag(median_resale_price) * 100) %>%
  fill(pct_chg, .direction = 'up') %>% 
  mutate_if(is.numeric, round, digits = 1) -> median_chg_4rm_2010_2021

datatable(median_chg_4rm_2010_2021, rownames = FALSE)

With the data now ready, we can visualize it in a dumbbell plot to see the change between two periods

ggplot(median_chg_4rm_2010_2021,mapping=aes(x=median_resale_price,
                                              y=reorder(town, pct_chg), 
                                              color=period, group = town)) +
geom_line(size=1.2) +
geom_point(size=2) +
labs(x='Median Resale Price', y= 'Town', 
     title = "Change in Median Resale Prices across Towns", 
     subtitle = "4 Room HDB Transactions between 2010-2021", 
     caption = 'Data: HDB, data.gov.sg (2022)') +
scale_x_continuous(breaks = seq(300000,1000000,100000) , 
                   limits = c(300000,1000000), 
                   labels = scales::dollar_format()) +
scale_color_brewer(palette = "Paired", name = "Year", labels = c("2011", "2021")) +
theme_minimal() +
theme(legend.position = "bottom",
      legend.key.height=unit(0, "cm"),
      axis.text.x = element_text(size = 10),
      axis.text.y = element_text(size = 10),
      axis.title.x = element_text(vjust=-2),
      axis.title.y = element_text(vjust=3),
      plot.title = element_text(face ='bold')
      ) +
geom_text(data = median_chg_4rm_2010_2021 %>% filter(Year == 2021), 
          aes(label = paste(pct_chg,"%"), 
              x = median_resale_price, 
              y = town), size = 3, hjust=0, nudge_x = 15000)

Conclusion - We can tell from the plot that towns which started with higher resale prices in 2011 tend to see more growth, likely because properties in these towns are highly sought after due to their central location. It is interesting that Marine Parade which started with one of the highest resale prices in 2010, fetched the least growth over the years. On the other hand, Yishun which started with the lowest resale price had significant gain over other towns.

3 | Median Resale Prices across Towns and Flat Types in 2021

We will now focus on transactions in 2021 by comparing the median resale prices across towns and flat types. To avoid the median resale values showing in scientific notation, we will set scipen penalty to 10.

resale_2021 %>%
  group_by(flat_type, town) %>% 
  summarise(median_resale_price = median(resale_price), transactions=n()) %>% 
  filter(flat_type %in% c("3 ROOM","4 ROOM","5 ROOM")) %>% 
  arrange(-median_resale_price) -> resale_2021_towns

We can plot a heatmap to compare the median resale prices across towns and flat types.

options(scipen = 10L)

(p <- ggplot(resale_2021_towns, aes(x=flat_type, y=reorder(town,median_resale_price))) +
geom_tile(aes(fill = median_resale_price)) +
geom_text(aes(label=median_resale_price), size = 2.5) +
labs(x='Flat Type', y="Town", 
     title = "Median Resale Prices across Towns and Flat Types", 
     subtitle = "HDB Resale Transactions in 2021", 
     caption = 'Data: HDB, data.gov.sg (2022)',
     fill = "Median Resale Price") +
scale_fill_gradient(low="white",high='red', labels=scales::dollar_format()) +
theme_minimal() +
theme(axis.text.x = element_text(size = 6),
      axis.text.y = element_text(size = 6),
      axis.title.x = element_blank(),
      axis.title.y = element_blank(),
      plot.title = element_text(face ='bold'),
      plot.margin=unit(c(1,2,1,1),"cm")
      )
)

We can try plotting the heatmap as an interactive plot with ggplotly, hover over to inspect the values.

ggplotly(p,height = 500, width=675) %>%
  layout(title = list(text = paste0('<b>Median Resale Prices across Towns and Flat Types<b>',
                                    '<br>','<sup>','HDB Resale Transactions in 2021','</sup>')),
         annotations = list(x = 0.98, y = -0.08, text = "Data: HDB, data.gov.sg (2022)",
                            showarrow = F, xref='paper', yref='paper', xanchor='right',
                            yanchor='auto', xshift=0, yshift=0,font=list(size=12))
         )

Conclusion - We can deduce the effect of location on resale prices by comparing units of different towns and of the same flat type through this heatmap. We can see that mature estates such as Central Area and Queenstown are usually most expensive. Interestingly, areas such as Pasir Ris had the highest transacted 3 Room Flat in 2021, the resale price of $442,500 can easily match the median resale price of 4 Room units in many other towns. The same can be said for the 4 Room units in towns such as Bishan and Toa Payoh which transacted for $570,000 and $620,000 respectively. These units can easily match the median resale price of 5 room units in other towns.

4 | Effect of Storey Height on Resale Price in 2021

We can also use the dataset to determine the effect of Storey Height on Resale Prices. First, lets format the remaining lease column and add a new column for Price per square metre. Comparing price per square metre will allow us to compare all units regardless of floor area.

resale_2021 %>% 
  mutate(remaining_lease = 99 -(year(Date)- lease_commence_date), 
         price_per_sqm = resale_price / floor_area_sqm) %>%
  mutate_if(is.numeric, round, digits = 2) %>% 
  select(Date, storey_range, price_per_sqm, everything())-> resale_2021

datatable(head(resale_2021))

We can now plot a boxplot of HDB Resale price per square metre by Storey Range

(box <- ggplot(resale_2021, mapping=aes(x=storey_range, y=price_per_sqm)) +
geom_boxplot(outlier.alpha = 0.1) +
labs(x='Storey Range', y= 'Price per sqm (SGD)', 
     title = "Effect of Storey Height on Resale Price in 2021", 
     subtitle = "HDB Resale Transactions in 2021", 
     caption = 'Data: HDB, data.gov.sg (2022)') +
scale_y_continuous(labels = scales::dollar_format()) +
theme_minimal() +
theme(axis.text.x = element_text(size = 6),
        axis.text.y = element_text(size = 7),
        axis.title.x = element_text(vjust=-0.5),
        axis.title.y = element_text(vjust=3),
        plot.title = element_text(face ='bold')
        )
)

We can also plot the boxplot as an interactive plot with ggplotly, hover over to inspect the values.

ggplotly(box, height = 500, width=700) %>%
  layout(title = list(text = paste0('<b>Effect of Storey Height on Resale Price in 2021<b>',
                                    '<br>','<sup>','HDB Resale Transactions in 2021','</sup>')),
         annotations = list(x = 0.98, y = -0.08, text = "Data: HDB, data.gov.sg (2022)",
                            showarrow = F, xref='paper', yref='paper', xanchor='right',
                            yanchor='auto', xshift=0, yshift=0,font=list(size=12))
         )


Conclusion - We can tell from the plot that taller units are sold at higher price/sqm. Although there are outliers ranging from 01 to 18 storey high that are sold at a higher price compared to the average, it may be attributed to other factors such as location or remaining lease, On the other hand, taller units have less outliers as these units are likely to be sold for a high price irregardless of their location and remaining lease.

5 | Effect of Lease on Floor Area and Resale Price in 2021

We can also use the same dataset to find out the effect of lease on floor area and resale prices. Since we would like to find out if newer flats are smaller than old flats, we will be focusing the analysis on 4 Room HDB for a fair comparison.

ggplot(resale_2021 %>% filter(flat_type %in% "4 ROOM"), 
       aes(x=remaining_lease, y=floor_area_sqm, color=price_per_sqm)) + 
geom_smooth(alpha=0.5, color='pink') +
geom_point(size=1) +
labs(x='Remaining Lease', y= 'Floor Area (sqm)', 
     title = "Effect of Lease on Floor Area and Resale Price", 
     subtitle = "4 Room HDB Resale Transactions in 2021", 
     caption = 'Data: HDB, data.gov.sg (2022)',
     color = "Price per sqm") +
scale_colour_continuous(labels=scales::dollar_format(), type='viridis')+
scale_x_continuous(breaks = seq(45,95,5)) +
theme_minimal() +
theme(legend.position = "right",
      axis.text.x = element_text(size = 9),
      axis.text.y = element_text(size = 7),
      axis.title.x = element_text(vjust=-0.5),
      axis.title.y = element_text(vjust=3),
      plot.title = element_text(face ='bold'),
      panel.grid.minor.x = element_blank(),
      aspect.ratio=1/2
      )

resale_2021 %>% 
  filter(flat_type %in% "4 ROOM") %>% 
  select(floor_area_sqm, remaining_lease, price_per_sqm,
         resale_price,block,street_name, flat_model,everything()) %>%
  arrange(remaining_lease) -> resale_2021_4rm

datatable(head(resale_2021_4rm,10),
          caption = 'Least remaining lease')
resale_2021_4rm %>% 
  arrange(-price_per_sqm) -> resale_2021_4rm

datatable(head(resale_2021_4rm,10),
          caption = 'Highest price per square metre')

Conclusion - We can tell from the plot that flats with 60-80 years remaining lease are generally larger (100-120sqm) than newer flats with 80-90 years remaining (90-100sqm). Also, there was a drastic decrease in size for flats with 45-50 years lease remaining. According to the tables above, some were adjoined flats while other are terraces built by the Singapore Improvement Trust. Interestingly, these larger terrace houses ($930,000 | $7750/sqm) is cheaper when compared to the smaller but newer 4 Room units in PinnacleDuxton ($1,200,000 | $12903/sqm).


Part 2 - Recreating a new Plot Type

Part 2 will recreate the ridgeline plot below using the same HDB Resale Flat Prices dataset in Part 1.

The ridgeline plot shows the distribution of a numeric value for multiple groups in the data. In order to remain consistent with other sections, we will be referring to neighbourhoods as towns.

1 | Recreating the plot

We will first need to transform the data to show the 2021 HDB Resale Prices in price per square metre grouped by the various towns. Since we have already calculated the price per square metre earlier to analyse the effect of Storey Height on Resale Price, we will continue to use that dataset. We will also apply the str_to_tile function from stringr library to convert the first letter of every town name to uppercase.

resale_2021 %>% 
  mutate(town = str_to_title(town)) %>% 
  ggplot(aes(x=price_per_sqm, y = reorder(town,-price_per_sqm), fill=stat(x))) +
  geom_density_ridges_gradient(rel_min_height = 0.005, scale = 1.5, 
                               color='white',lwd = 1, show.legend = FALSE) +
  labs(x='Price per square metre (SGD)', y='Town',
     title = "HDB resale prices in 2021 by Town", 
     subtitle = "Towns exhibit large differences", 
     caption = 'Data: Housing and Development Board, Singapore,2022.') +
scale_fill_viridis(option="C") +
scale_x_continuous(breaks = seq(2500, 12500,2500)) +
xlim (c (2500,12500)) +
theme_minimal() +
theme(axis.text.x = element_text(size = 12),
      axis.text.y = element_text(size = 12),
      axis.title.x = element_text(size = 14, vjust=0),
      axis.title.y = element_text(size = 14, vjust=0),
      plot.title = element_text(size = 18),
      plot.subtitle = element_text(size= 16),
      plot.caption = element_text(size = 12),
      panel.grid = element_blank()
      ) 

2 | Pushing it further

Above is the recreated ridgeline plot. We can push this plot further by visualising the distribution of prices by flat types. Similar to the previous parts, we will be focusing on 3,4,5 Room and Executive HDB units. We will do some adjustments to the axes and add a vertical lines at the 5th, 50th and 95th percentile to understand the distribution across towns.

resale_2021 %>% 
  filter(flat_type %in% c("3 ROOM", "4 ROOM", "5 ROOM", "EXECUTIVE")) %>% 
  mutate(f_town = reorder_within(town, price_per_sqm, flat_type)) %>% 
    ggplot(aes(x=price_per_sqm, y = reorder(f_town,-price_per_sqm), fill=stat(x))) +
  geom_density_ridges_gradient(rel_min_height = 0.005, scale = 1, 
                               quantile_lines= TRUE, quantiles = c(0.05, 0.5, 0.95),
                               color='white',lwd = 1, show.legend = FALSE) +
  labs(x='Price per square metre (SGD)', y='Town',
     title = "HDB resale prices in 2021 by Towns and Flat Types", 
     subtitle = "Towns exhibit large differences", 
     caption = 'Data: HDB, data.gov.sg (2022)') +
scale_y_reordered() +
scale_fill_viridis(option="C") +
scale_x_continuous(breaks = seq(2500, 12500,2500),
                   limits = c(2500,12500),
                   labels = scales::dollar_format()) +
theme_minimal() +
theme(axis.text.x = element_text(size = 12),
      axis.text.y = element_text(size = 12),
      axis.title.x = element_text(size = 14, vjust=-3),
      axis.title.y = element_text(size = 14, vjust=3),
      plot.title = element_text(size = 18, face="bold"),
      plot.subtitle = element_text(size= 16),
      plot.caption = element_text(size = 12),
      strip.text = element_text(size = 14, face="bold"),
      panel.spacing = unit(2, "lines"),
      panel.grid.major.x = element_line(color='grey75', size=1),
      panel.grid.minor.x = element_line(color='grey90', size=1),
      panel.grid.major.y = element_blank()
      ) +
facet_wrap(~flat_type, ncol=2, scales="free")

Conclusion - We can see how effective ridgeline plots are in showing distribution of multiple groups. Ridgeline plots are especially effective in showing a clear pattern and ranking, in this case the towns which HDBs are priced higher than others. By breaking it down further to flat types, we can understand the distribution of each town better. For instance, We can also observe that the prices of Pasir Ris 3 Room units are more widely spreaded out when compared to 4 Room, 5 Room and Executive units in the same town. Thus, 3 Room units in Pasir Ris fetched the highest prices only behind units in the Central Area, even though it had one of the lowest overall resale prices in 2021.


Demographic Analysis

For Part 3 we will be analyzing the demographics dataset of Singapore Residents by Subzone and Type of Dwelling. We can find the 2011-2019 dataset from data.gov.sg but SingStat offers a wider range of data from 2000-2020 under Population Trends.


Part 3 - Freeform Task

Dataset of Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2000-2010 and June 2011-2020

According to the metadata, the data was last updated on 24 Sep 2020 by the Singapore Department of Statistics and contains population data from June 2000-2020. It contains information on the Planning Area, Subzone, Age Group, Sex, Type of Dwelling, Resident count and Year.

It is also important to note that data has been rounded to the nearest 10 and may not add up due to rounding. Also, the planning areas referred to have changed over the years in accordance to the Urban Redevelopment Authority’s Master Plan demarcation of areas.

  • For June 2000: Master Plan 1998
  • For June 2001-2010: Master Plan 2008
  • For June 2011-2019: Master Plan 2014
  • For June 2020 onwards : Master Plan 2019

As we review the Master Plan Planning Area Boundaries of 1998, 2008, 2014, 2019. We can see that the planning areas were largely unchanged but expanded over time as Singapore reclaimed more land. Below is a photo of the maps overlay.

pop_2000_2010 <- read_csv("data/respopagesextod2000to2010.csv") 
pop_2011_2020 <- read_csv("data/respopagesextod2011to2020.csv")
pop <- rbind(pop_2000_2010,pop_2011_2020)

datatable(head(pop_2000_2010))
datatable(head(pop_2011_2020))

Inspect Data

After combining the two datasets from 2000-2010 and 2011-2020, we realise that some subzones such as “Ang Mo Kio Town Centre” were introduced at a later date. It is likely that the planning area has not changed over the years but subzones were added. This is important to note if we are comparing data between subzones. We will identify which subzones are newly created.

pop %>% 
  group_by(SZ) %>% 
  summarise(starting_year = min(Time)) %>%
  filter (starting_year >= '2001') %>% 
  arrange (desc(starting_year)) -> sz_created

datatable(sz_created, rownames = FALSE)

From the datatable above, we can clearly see that subzones were introduced in 2001, 2011 and 2020 in line with the change in demarcation of Master Plan 2008, 2014 and 2019. Thus, any analysis regarding subzones should only be comparing data between 2001-2010 or 2011-2019. This may be the reason why the dataset from data.gov.sg mentioned earlier only provides data between 2011-2019.

We will check if any planning areas were created at a later date.

unique(pop$PA)
##  [1] "Ang Mo Kio"              "Bedok"                  
##  [3] "Bishan"                  "Boon Lay/Pioneer"       
##  [5] "Bukit Batok"             "Bukit Merah"            
##  [7] "Bukit Panjang"           "Bukit Timah"            
##  [9] "Central Water Catchment" "Changi"                 
## [11] "Changi Bay"              "Choa Chu Kang"          
## [13] "Clementi"                "Downtown Core"          
## [15] "Geylang"                 "Hougang"                
## [17] "Jurong East"             "Jurong West"            
## [19] "Kallang"                 "Lim Chu Kang"           
## [21] "Mandai"                  "Marina East"            
## [23] "Marina South"            "Marine Parade"          
## [25] "Museum"                  "Newton"                 
## [27] "North-Eastern Islands"   "Novena"                 
## [29] "Orchard"                 "Outram"                 
## [31] "Pasir Ris"               "Paya Lebar"             
## [33] "Punggol"                 "Queenstown"             
## [35] "River Valley"            "Rochor"                 
## [37] "Seletar"                 "Sembawang"              
## [39] "Sengkang"                "Serangoon"              
## [41] "Simpang"                 "Singapore River"        
## [43] "Southern Islands"        "Straits View"           
## [45] "Sungei Kadut"            "Tampines"               
## [47] "Tanglin"                 "Tengah"                 
## [49] "Toa Payoh"               "Tuas"                   
## [51] "Western Islands"         "Western Water Catchment"
## [53] "Woodlands"               "Yishun"                 
## [55] "Not Stated"              "Boon Lay"               
## [57] "Pioneer"
pop %>% 
  group_by(PA) %>% 
  summarise(starting_year = min(Time)) %>% 
  arrange (desc(starting_year)) -> pa_created

datatable(pa_created, rownames = FALSE)

From the datatable above, we observe that Boon Lay/Pioneer was split in 2001. There is also a category for “Not Stated” which will need to be dropped.

pop %>% 
  filter(PA == "Not Stated") -> pop_not_stated

unique(pop_not_stated$Time)
## [1] 2000

We can see from above that “Not Stated” only affects data in 2000. We will not remove data affected by the splitting of Boon Lay/Pioneer and Not Stated as we may need the total population count for our analysis. Thus, any analysis regarding planning areas should only be comparing data between 2001-2020.

Next, we will check the other categories of Age Group, Sex, Year and Types of Dwellings to ensure they fall within reasonable range.

unique(pop$AG)
##  [1] "0_to_4"      "5_to_9"      "10_to_14"    "15_to_19"    "20_to_24"   
##  [6] "25_to_29"    "30_to_34"    "35_to_39"    "40_to_44"    "45_to_49"   
## [11] "50_to_54"    "55_to_59"    "60_to_64"    "65_to_69"    "70_to_74"   
## [16] "75_to_79"    "80_to_84"    "85_to_89"    "90_and_over"
unique(pop$TOD)
## [1] "HDB 1- and 2-Room Flats"                
## [2] "HDB 3-Room Flats"                       
## [3] "HDB 4-Room Flats"                       
## [4] "HDB 5-Room and Executive Flats"         
## [5] "HUDC Flats (excluding those privatised)"
## [6] "Landed Properties"                      
## [7] "Condominiums and Other Apartments"      
## [8] "Others"
unique(pop$Sex)
## [1] "Males"   "Females"
unique(pop$Time)
##  [1] 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## [16] 2015 2016 2017 2018 2019 2020

Organise Data

We then rename the ‘time’ column to ‘Year’, we will also replace age group values to double digits and format it as XX-XX and XX+.

pop %>% 
  rename(Year = Time, Gender = Sex) -> pop
pop$AG <- pop$AG %>%
  replace(pop$AG == "0_to_4", "00_to_04") %>%
  replace(pop$AG == "5_to_9", "05_to_09")

pop$AG <- str_replace(pop$AG, "_to_","-")
pop$AG <- str_replace(pop$AG, "_and_over","+")
unique(pop$AG)
##  [1] "00-04" "05-09" "10-14" "15-19" "20-24" "25-29" "30-34" "35-39" "40-44"
## [10] "45-49" "50-54" "55-59" "60-64" "65-69" "70-74" "75-79" "80-84" "85-89"
## [19] "90+"

1 | Population Pyramid 2000-2020

We can start with a population pyramid to visualize the distribution of the population between different age group and gender. To generate a population pyramid, we will set the Male population to negative.

pop %>% 
  mutate(Pop = ifelse(Gender == "Males",-1*Pop,Pop)) %>% 
  group_by(Year,AG, Gender) %>%
  mutate(Gender = factor(Gender, levels = c("Males", "Females"))) %>% 
  summarise (Pop = sum(Pop)) -> pop_pyr
ggplot(pop_pyr %>% filter(Year==2020), 
       aes(x = AG, y = Pop / 1000, fill = Gender)) +
geom_col() +
coord_flip() +
facet_share(~Gender, dir = "h", scales = "free", reverse_num = TRUE) +
labs(y='Population (thousands)',
     title = "Resident Population in 2020", 
     subtitle = "Age-Gender Ratio in Singapore", 
     caption = 'Data: Singapore Department of Statistics (2020)') +
theme(plot.background = element_blank(),
     axis.ticks = element_line(color='white'),
     axis.title.y = element_blank(),
     panel.background = element_blank(),
     panel.border = element_blank(),
     strip.background = element_blank(),
     strip.text.x = element_text(size = 12,
                                 face ='bold'),
     panel.grid.minor = element_blank(),
     panel.grid.major.x = element_line(color='grey80'), 
     axis.text = element_text(size = 12),
     legend.position = "none",
     plot.title = element_text(size = 18,
                               hjust = 0.5,
                               face = 'bold'),
     plot.subtitle = element_text(size = 12,
                                  hjust = 0.5),
     axis.title.x = element_text(size = 12,
                                 face = 'bold',
                                 margin = margin(t = 10)),
     plot.caption = element_text(size = 10,
                                 hjust = 0.5),
     aspect.ratio = 3/2
      )

We can try to plot this in as an animated plot.

pop_pyr_plot <- 
ggplot(pop_pyr, aes(x = AG, y = Pop / 1000, fill = Gender)) +
geom_col() +
coord_flip() +
facet_share(~Gender, dir = "h", scales = "free", reverse_num = TRUE) +
theme(plot.background = element_blank(),
     axis.ticks = element_line(color='white'),
     axis.title.y = element_blank(),
     panel.background = element_blank(),
     panel.border = element_blank(),
     strip.background = element_blank(),
     strip.text.x = element_text(size = 24,
                                 face ='bold'),
     panel.grid.minor = element_blank(),
     panel.grid.major.x = element_line(color='grey80'), 
     axis.text = element_text(size = 24),
     legend.position = "none",
     plot.title = element_text(size = 36,
                               hjust = 0.5,
                               face = 'bold'),
     plot.subtitle = element_text(size = 24,
                                  hjust = 0.5),
     axis.title.x = element_text(size = 24,
                                 face = 'bold',
                                 margin = margin(t = 10)),
     plot.caption = element_text(size = 20,
                                 hjust = 0.5),
     aspect.ratio = 3/2
      ) + 
  labs(y='Population (thousands)',
       title = "Resident Population in {closest_state}", 
       subtitle = "Age-Gender Ratio in Singapore", 
       caption = 'Data: Singapore Department of Statistics (2020)') +
  transition_states(Year,
                    transition_length = 1,
                    state_length = 1) + 
  enter_fade() +
  exit_fade() + 
  ease_aes('cubic-in-out')

animate(pop_pyr_plot,
        fps = 24,
        duration = 20,
        width = 1000,
        height = 1000,
        renderer = gifski_renderer('pop_pyr.gif')
       )

Conclusion - We can observe from the animated population pyramid, the effect of Singapore’s aging population which distribution of age groups trend towards older age groups. We can also deduce the increase in life expectancy (80-90+) and reduction in fertility rate (00-04).

2 | Dependency Ratio by Planning Area

We will now focus on analyzing the old-age and child dependency ratio to compare between different planning areas.The dependency ratio is the ratio of the number of children (aged 14 and below) or elderly people at an age when they are generally economically inactive (aged 65 and over), compared to the number of people of working age (15-64 years old).

First, we will perform data wrangling to categorize the age groups into young, working and old. Then we will narrow our analysis to 2011-2019 for the reasons mentioned before.

pop %>% 
  mutate(AG = case_when( str_detect(AG, "00|05|10") ~ "young",
                         str_detect(AG, "15|20|25|30|35|40|45|50|55|60") ~ "working",
                         str_detect(AG, "65|70|75|80|85|90") ~ "old")) %>% 
  filter(Year == 2011 | Year == 2019) %>% 
  group_by(PA, Year, AG) %>% 
  summarise(Pop = sum(Pop)) %>% 
  filter(Pop != 0) -> pop_depend

datatable(pop_depend)
unique(pop_depend$AG)
## [1] "old"     "working" "young"
unique(pop_depend$PA)
##  [1] "Ang Mo Kio"              "Bedok"                  
##  [3] "Bishan"                  "Bukit Batok"            
##  [5] "Bukit Merah"             "Bukit Panjang"          
##  [7] "Bukit Timah"             "Changi"                 
##  [9] "Choa Chu Kang"           "Clementi"               
## [11] "Downtown Core"           "Geylang"                
## [13] "Hougang"                 "Jurong East"            
## [15] "Jurong West"             "Kallang"                
## [17] "Lim Chu Kang"            "Mandai"                 
## [19] "Marine Parade"           "Museum"                 
## [21] "Newton"                  "Novena"                 
## [23] "Orchard"                 "Outram"                 
## [25] "Pasir Ris"               "Punggol"                
## [27] "Queenstown"              "River Valley"           
## [29] "Rochor"                  "Seletar"                
## [31] "Sembawang"               "Sengkang"               
## [33] "Serangoon"               "Singapore River"        
## [35] "Southern Islands"        "Sungei Kadut"           
## [37] "Tampines"                "Tanglin"                
## [39] "Toa Payoh"               "Western Water Catchment"
## [41] "Woodlands"               "Yishun"

We notice that some of the areas such as Central Water Catchment (Nature Reserve), North-Eastern Islands, Paya Lebar (Air Base) have been dropped because they do not have population.

Next, with the transformed data we can now calculate and plot the Child Dependency Ratio and Old-Age Dependency Ratio. We will start by calculating the overall ratio and then ratio for each planning area.

pop_depend %>% 
  group_by(Year) %>% 
  summarise(old = sum(Pop[AG == "old"]), working = sum(Pop[AG == "working"])) %>%
  group_by(Year) %>% 
  summarise(overall_aged_ratio = old/working) %>% 
  mutate_if(is.numeric, round, digits = 3) -> pop_overall_aged_ratio

datatable(pop_overall_aged_ratio, rownames = FALSE)
pop_depend %>% 
  group_by(PA,Year) %>% 
  summarise(aged_ratio = Pop[AG == "old"] / Pop[AG == "working"]) %>% 
  group_by(PA) %>%
  mutate(pct_chg = ((aged_ratio) - lag(aged_ratio)) /lag(aged_ratio) * 100) %>% 
  fill(pct_chg, .direction = 'up') %>%
  mutate(period = case_when(Year == 2011 ~ "Old", Year == 2019 ~ "New" )) %>%
  mutate(period = factor(period, levels = c("Old","New"))) %>% 
  mutate_if(is.numeric, round, digits = 3) %>% 
  filter (!is.na(pct_chg))-> pop_aged_ratio

datatable(pop_aged_ratio)
pop_depend %>% 
  group_by(Year) %>% 
  summarise(young = sum(Pop[AG == "young"]), working = sum(Pop[AG == "working"])) %>%
  group_by(Year) %>% 
  summarise(overall_child_ratio = young/working) %>% 
  mutate_if(is.numeric, round, digits = 3) -> pop_overall_child_ratio

datatable(pop_overall_child_ratio, rownames = FALSE)
pop_depend %>% 
  group_by(PA,Year) %>% 
  summarise(child_ratio = Pop[AG == "young"] / Pop[AG == "working"]) %>% 
  group_by(PA) %>%
  mutate(pct_chg = ((child_ratio) - lag(child_ratio)) /lag(child_ratio) * 100) %>% 
  fill(pct_chg, .direction = 'up') %>% 
  mutate(period = case_when(Year == 2011 ~ "Old", Year == 2019 ~ "New" )) %>%
  mutate(period = factor(period, levels = c("Old","New"))) %>%
  mutate_if(is.numeric, round, digits = 3)-> pop_child_ratio

datatable(pop_child_ratio)
ggplot(pop_aged_ratio,mapping=aes(x=aged_ratio, y=reorder(PA, pct_chg), 
                                              color=period, group = PA)) +
geom_vline(xintercept = pop_overall_aged_ratio$overall_aged_ratio, 
           linetype="dashed", color = c("#A6CEE3","#1F78B4"), size=0.75) +
geom_line(size=1.4) +
geom_point(size=2.5) +
labs(x='Old Age Dependency Ratio', y= 'Planning Area', 
     title = "Change in Old Age Depedency Ratio across Planning Areas", 
     subtitle = "People aged 65 and above per 100 people in the workforce", 
     caption = 'Data: Singapore Department of Statistics (2020)') +
scale_x_continuous(breaks = seq(0,0.5,0.1) , 
                   limits = c(-0.05,0.5), expand =c(0.025,0)) +
scale_color_brewer(palette = "Paired", name = "Year", labels = c("2011", "2019")) +
theme_minimal() +
theme(legend.position = "bottom",
      legend.key.height=unit(0, "cm"),
      axis.text.x = element_text(size = 10),
      axis.text.y = element_text(size = 10),
      axis.title.x = element_text(vjust=-2),
      axis.title.y = element_text(vjust=3),
      plot.title = element_text(face ='bold')
      ) +
geom_rect(mapping = aes(xmin = -0.02, xmax = -Inf , ymin = -Inf, ymax = Inf),
    fill = "white",
    color ="white") +
geom_text(data = pop_aged_ratio %>% filter(Year == 2019), 
          aes(label = paste(round(pct_chg, digits=1),"%"), 
              x = -0.05, y = PA), size = 3, hjust=0, nudge_x = 0)

ggplot(pop_child_ratio,mapping=aes(x=child_ratio, y=reorder(PA, pct_chg), 
                                              color=period, group = PA)) +
geom_vline(xintercept = pop_overall_child_ratio$overall_child_ratio, 
           linetype="dashed", color = c("#A6CEE3","#1F78B4"), size=0.75) +
geom_line(size=1.4) +
geom_point(size=2.5) +
labs(x='Child Dependency Ratio', y= 'Planning Area', 
     title = "Change in Child Depedency Ratio across Planning Areas", 
     subtitle = "People aged 14 and below per 100 people in the workforce", 
     caption = 'Data: Singapore Department of Statistics (2020)') +
scale_x_continuous(breaks = seq(0,0.5,0.1) , 
                   limits = c(-0.05,0.5), expand =c(0.025,0)) +
scale_color_brewer(palette = "Paired", name = "Year", labels = c("2011", "2019")) +
theme_minimal() +
theme(legend.position = "bottom",
      legend.key.height=unit(0, "cm"),
      axis.text.x = element_text(size = 10),
      axis.text.y = element_text(size = 10),
      axis.title.x = element_text(vjust=-2),
      axis.title.y = element_text(vjust=3),
      plot.title = element_text(face ='bold')
      ) +
  geom_rect(mapping = aes(xmin = -0.02, xmax = -Inf , ymin = -Inf, ymax = Inf),
    fill = "white",
    color ="white") +
geom_text(data = pop_child_ratio %>% filter(Year == 2019), 
          aes(label = paste(round(pct_chg, digits=1),"%"), 
              x = -0.05, y = PA), size = 3, hjust=0)

pop_depend %>% 
  group_by (PA,Year) %>%
  summarise(total_pop = sum(Pop)) %>% 
  arrange(total_pop) -> pop_pa

datatable(pop_pa)

Conclusion - We can observe from the dumbbell plots that there has been an overall increase in old age dependency across Singapore. On the other hand, Child dependency ratio has been decreasing across Singapore likely due to the lower fertility rate. However, we can see that towns like Punggol still saw an increase. This is likely because it is a new town with large number of young families. We also noticed that both plots have some planning areas with very large changes, this is likely due to their lower population count. For example, Museum which saw an increaseof 464% only had a population of 250 in 2011 and 450 in 2019.